Statistics and Econometrics

Problem 1: Descriptive Statistics and Probability Theory: Real Data on CEO Compensation

CEO compensations data

salary  = 1999 salary + bonuses in 1000 US dollar
totcomp = 1999 CEO total compensation
tenure  = # of years as CEO (=0 if less than 6 months)
age     = age of CEO
sales   = total 1998 sales revenue of firm i
profits = 1998 profits for firm i
assets  = total assets of firm i in 1998

Data exploration

In [1]:
import pandas as pd
import numpy as np

Sample of data

In [2]:
df = pd.read_csv('ceo.csv')
df = df.drop('Unnamed: 7', axis=1)
df.head(5)
Out[2]:
salary totcomp tenure age sales profits assets
0 3030 8138 7 61 161315.0 2956.0 257389.0
1 6050 14530 0 51 144416.0 22071.0 237545.0
2 3571 7433 11 63 139208.0 4430.0 49271.0
3 3300 13464 6 60 100697.0 6370.0 92630.0
4 10000 68285 18 63 100469.0 9296.0 355935.0

Descriptive statistics

In [3]:
df.describe()
Out[3]:
salary totcomp tenure age sales profits assets
count 447.000000 447.000000 447.000000 447.000000 447.000000 447.000000 447.000000
mean 2027.516779 8340.058166 7.834452 56.469799 11557.780984 700.460850 27054.290828
std 1722.566389 31571.803005 8.246721 6.806641 16168.368902 1542.538013 64659.043191
min 100.000000 100.000000 0.000000 34.000000 2896.400000 -2669.000000 717.800000
25% 1084.000000 1575.500000 2.000000 52.000000 4184.150000 108.450000 3856.950000
50% 1600.000000 2951.000000 5.000000 57.000000 6704.000000 333.100000 7810.800000
75% 2347.500000 6043.000000 10.000000 61.000000 12976.800000 738.000000 21105.550000
max 15250.000000 589101.000000 60.000000 84.000000 161315.000000 22071.000000 668641.000000
In [4]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(rows=4, cols=2, subplot_titles=(df.columns))

traces = []
for column_name in df.columns:
    traces.append(go.Histogram(x=df[column_name].values, name=column_name))

fig.append_trace(traces[0], 1, 1)
fig.append_trace(traces[1], 1, 2)
fig.append_trace(traces[2], 2, 1)
fig.append_trace(traces[3], 2, 2)
fig.append_trace(traces[4], 3, 1)
fig.append_trace(traces[5], 3, 2)
fig.append_trace(traces[6], 4, 1)

fig.update_layout(
    title_text='Distributions ', height=1000)
fig.show()

(A) For the variable totcomp compute the common location measures: mean, 5%- trimmed mean, median, upper and lower quartiles, the upper and lower 5%- quantiles. Give an economic interpretation for every location measure.

In [240]:
from scipy.stats import trim_mean

location_measure_result = [['Mean',round(df['totcomp'].mean(),3)], 
                           ['5%- trimmed mean',round(trim_mean(df['totcomp'].values, 0.05),3)],
                           ['Median', df['totcomp'].median()],
                           ['Quantile 25%',df['totcomp'].quantile(0.25)],
                           ['Quantile 75%',df['totcomp'].quantile(0.75)],
                           ['Quantile 5%',df['totcomp'].quantile(0.05)],
                           ['Quantile 95%',round(df['totcomp'].quantile(0.95),3)],
                           ['Min', round(df['totcomp'].min(),3)],
                           ['Max', round(df['totcomp'].max(),3)]]


html_text = "<table style='font-size: 11pt; width: 300px;'><tr><th>Location measure</th><th>Value</th></tr>"

for row in location_measure_result:
    html_text += '<tr>'
    for value in row:
        html_text += '<td>' + str(value) + '</td>'
    html_text += '</tr>'

html_text += "</table>"
In [238]:
display(HTML(html_text))
#pd.DataFrame(location_measure_result, columns=['Measure', 'value'])
Location measureValue
Mean8340.058
5%- trimmed mean4637.68
Median2951.0
Quantile 25%1575.5
Quantile 75%6043.0
Quantile 5%783.7
Quantile 95%24563.3
Min100
Max589101

The values of total compansations for CEO are in range from $100$ to $589101$ (min and max values).
Mean value is $8340.058$, median value is $2951.0$, which is far from mean value.
The reason is that we sample of data has a few large outliers (max value = $589101$, Quantile $95\%$ = $24563.3$)
So the mean value is skewed. If we trimm $5%$ from both sides - we get $5%$- trimmed mean value $4637.68$, which is closer to median value.
The most values are between $1575.5$ and $6043.0$ ($25\%$ and $75\%$ quantile)

(b) Plot the empirical cumulative distribution function. Compute and explain in economic terms the following quantities

$i)~\hat{F}^{-1}(0.1)$ and $\hat{F}^{-1}(0.9)$

$ii)~\hat{F}(2000)$ and $1-\hat{F}(4000)$

In [337]:
data = df['totcomp'].values
def ecdf(x):
    x = np.sort(x)
    def result(v):
        return np.searchsorted(x, v, side='right') / x.size
    return result
In [295]:
fig = go.Figure()
fig.add_scatter(x=np.unique(data), y=ecdf(data)(np.unique(data)), line_shape='hv')
#fig.add_scatter(x=[2000, 2000], y=[0,1], line_shape='hv')

#display(HTML('<h1 style="text-align: center">The empirical cumulative distribution function</h1>'))
fig.update_layout(title_text="<span style='font-size: 18pt;'>The empirical cumulative distribution function</b></span>")
fig.show()
In [407]:
html_text = '$\hat{F}^{-1}(0.1)=' + str(round(df['totcomp'].quantile(0.1),5))+'$ - this value is higher than 10% of observed values<br>'
html_text += '$\hat{F}^{-1}(0.9)=' + str(round(df['totcomp'].quantile(0.9),5))+'$ - this value is higher than 90% of observed values<br>'
html_text += '$\hat{F}(2000)=' + str(round(ecdf(data)(2000),5))+'$ - relative number of observations equal to or less than 2000<br>'
html_text += '$1-\hat{F}(4000)=' + str(round(1-ecdf(data)(2000),5))+'$ - relative number of observations higher than 4000<br>'

display(HTML(html_text))
$\hat{F}^{-1}(0.1)=1002.4$ - this value is higher than 10% of observed values
$\hat{F}^{-1}(0.9)=15046.0$ - this value is higher than 90% of observed values
$\hat{F}(2000)=0.34452$ - relative number of observations equal to or less than 2000
$1-\hat{F}(4000)=0.65548$ - relative number of observations higher than 4000

(c) Plot the histogram of totcomp and the Box-plot (or violin-plot). What can be concluded about the distribution of the data? Are the location measures computed above still appropriate? Compute and discuss an appropriate measure of symmetry.

In [293]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Histogram(x=df['totcomp'].values)])
fig.update_layout(title_text="<span style='font-size: 18pt;'>The histogram of <b>totcomp</b></span>")
fig.show()
In [296]:
import plotly.express as px
fig = px.violin(df, y='totcomp', box=True, # draw box plot inside the violin
                points='all', # can be 'outliers', or False
            )
fig.update_layout(title_text="<span style='font-size: 18pt;'>The violin plot of <b>totcomp</b></span>")
fig.show()
In [409]:
from scipy.stats import skew

print('Skew value: ', skew(df['totcomp'].values))
Skew value:  14.796488300344116

As we can see - our distribution is very skewed. We have one very large outlier.
Skewness value $14.79648 > 0$, which means that there is more weight in the right tail of the distribution.

(d) Check which method is used in your software to compute the optimal bandwidth (or the number of bars) in the histogram. Describe it shortly here. Make plots of too detailed and too rough histograms. What can we learn from these figures?

method is used in your software to compute the optimal bandwidth (or the number of bars) in the histogram

In [298]:
import plotly.express as px
fig = px.histogram(df, x='totcomp', nbins=20)
fig.update_layout(title_text="<span style='font-size: 18pt;'>The rough histogram of <b>totcomp</b></span>", height=500)
fig.show()
In [299]:
fig = px.histogram(df, x='totcomp', nbins=200)
fig.update_layout(title_text="<span style='font-size: 18pt;'>The detailed histogram of <b>totcomp</b></span>", height=500)
fig.show()

I haven't found what is the method of the bandwidth optimization in plotly. Probably, it can be the Shimazaki and Shinomoto's choice
The choice is based on minimization of an estimated $L_2$ risk function: $$argmin_h \frac{2\hat{m}-v}{h^2}$$

where $\bar{m}$ and $v$ are mean and biased variance of a histogram with bin-width $h$

$$\bar{m}=\frac{1}{k} \sum_{i=1}^{k} m_i$$$$v= \frac{1}{k} \sum_{i=1}^{k} (m_i - \bar{m})^2 $$

(e) There are methods which help us make the distribution of the sample more symmetric. Consider the natural logarithm of the total compensation: ln(totcomp). Plot the histogram (and Box-plot) and compare it with the figures for the original data. Compute the mean and the median. What can be concluded from the new values? Provide economic interpretation.

In [413]:
log_values = np.log(df['totcomp'].values)
log_df = pd.DataFrame(log_values, columns=['totcomp'])

mean_log = np.mean(log_values)
median_log = np.median(log_values)

print('After the natural logarithm compensation')
print('Mean : ', mean_log)
print('Median : ', median_log)
After the natural logarithm compensation
Mean :  8.135686093478537
Median :  7.989899374942939
In [418]:
fig = go.Figure(data=[go.Histogram(x=log_values, name='Histogram')])
fig.add_trace(go.Scatter(x=[mean_log,mean_log], y=[0,42], name='Mean'))
fig.add_trace(go.Scatter(x=[median_log,median_log], y=[0,42], name='Median'))

fig.update_layout(title_text="<span style='font-size: 18pt;'>The histogram after the natural logarithm compensation for <b>totcomp</b></span>", height=500)
fig.show()
In [302]:
fig = px.violin(log_df, y='totcomp', box=True, # draw box plot inside the violin
                points='all') # can be 'outliers', or False
fig.update_layout(title_text="<span style='font-size: 18pt;'>The violin plot after the natural logarithm compensation for <b>totcomp</b></span>", height=600)
fig.show()

As we can see, log function essentially deemphasizes very large values.
After such compensation, mean and median metrics become almost the same, that also indicates that our sample becomes more centrilized.

2. (a) We suspect that the total compensation of the CEO and other variables are related. Compute the correlation coefficients of Pearson and plot them as a heatmap (correlation map). Discuss the strength of the correlations.

In [419]:
import itertools
def plot_heat_matrix(cm, classes,
                          title='Pearson correlation',
                          cmap=plt.cm.YlGnBu, #
                          bar_title="Pearson correlation",
                         xlabel='',
                         ylabel='', show_numbers=False):
    """
    This function prints and plots the heatmap.
    """
    plt.figure(figsize=(12, 12))
    plt.imshow(cm, cmap=cmap)
    plt.title(title, fontsize=20)
    
    cb = plt.colorbar()
    cb.ax.set_title(bar_title, fontsize=18)
    
    tick_marks = np.arange(len(classes))    
    plt.xticks(tick_marks, classes, rotation='vertical', fontsize=18)
    plt.yticks(tick_marks, classes, fontsize=18)
    plt.ylim((-0.5,6.5))
    
    #plt.legend()
    if show_numbers:
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, round(cm[i, j], 4),
                     horizontalalignment="center", fontsize=14, color="white" if cm[i, j] > thresh else "black")
    
    plt.tight_layout()
#     plt.ylabel(ylabel, fontsize=18)
#     plt.xlabel(xlabel, fontsize=18)
In [306]:
display(HTML('<h2>Pearson correlation</h2>'))
df_cor_pearson = df.corr(method='pearson')
df_cor_pearson

Pearson correlation

Out[306]:
salary totcomp tenure age sales profits assets
salary 1.000000 0.322023 0.172449 0.119466 0.371256 0.370315 0.431036
totcomp 0.322023 1.000000 0.067998 0.016046 0.105999 0.139931 0.083237
tenure 0.172449 0.067998 1.000000 0.405901 -0.027743 -0.006263 0.057682
age 0.119466 0.016046 0.405901 1.000000 0.050350 0.049096 0.081185
sales 0.371256 0.105999 -0.027743 0.050350 1.000000 0.725970 0.518214
profits 0.370315 0.139931 -0.006263 0.049096 0.725970 1.000000 0.510889
assets 0.431036 0.083237 0.057682 0.081185 0.518214 0.510889 1.000000
In [420]:
plot_heat_matrix(df_cor_pearson.values, df_cor_pearson.columns,
                          title='Pearson correlation',
                          cmap=plt.cm.YlGnBu,
                          bar_title='Pearson correlation',
                         xlabel='',
                         ylabel='', show_numbers=True)

The correlation between total compensation and other variables is comparatively small.
The slight correlation we have only with salary.

What about other variables, the big correlation have sales and profits, which makes sense.

(b) Plot the scatter plots (pairs in R). Conclude if the linear correlation coefficients are appropriate here. Compute now the Spearman’s correlations and make a heatmap. Compare the results with Pearson. What is the rank of the observation totcomp= 6000?

In spearman correlation we use ranks. To compute Rank[$totcomp= 6000$], we sort values from our sample and count how many values are less then 6000. The result:

In [328]:
print('Rank[totcomp= 6000] = ',len(df[df['totcomp']<6000]))
Rank[totcomp= 6000] =  334
In [21]:
import seaborn as sns
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(df)

As we have large outliers - our scatter plots are flattened. outliers has big influense on Pearson correlation.

In [329]:
display(HTML('<h2>Spearman correlation</h2>'))
df_cor_spearman = df.corr(method='spearman')
df_cor_spearman 

Spearman correlation

Out[329]:
salary totcomp tenure age sales profits assets
salary 1.000000 0.707831 0.191974 0.149269 0.421091 0.465469 0.431009
totcomp 0.707831 1.000000 0.186276 0.127189 0.357656 0.433552 0.402117
tenure 0.191974 0.186276 1.000000 0.402980 -0.029301 0.005143 -0.008935
age 0.149269 0.127189 0.402980 1.000000 -0.030716 0.084714 0.084490
sales 0.421091 0.357656 -0.029301 -0.030716 1.000000 0.527095 0.581529
profits 0.465469 0.433552 0.005143 0.084714 0.527095 1.000000 0.635524
assets 0.431009 0.402117 -0.008935 0.084490 0.581529 0.635524 1.000000
In [421]:
plot_heat_matrix(df_cor_spearman.values, df_cor_spearman.columns,
                          title='Spearman correlation',
                          cmap=plt.cm.YlGnBu,
                          bar_title='Spearman correlation',
                         xlabel='',
                         ylabel='', show_numbers=True)

Spearman correlation is less sensetive for outliers, that's why we can see real correlation between totcomp and salary, which we cannot notice in Pearson correlation.
Also we can see moderate correlation between totcomp and sales, profits and assets, which also makes sense.

(c) Consider the two subsamples: CEOs younger than 50 and older than 50. Plot for both subsamples overlapping histograms/ecdf’s and discuss the results. What can we learn from the corresponding location and dispersion (!) measures?

In [332]:
young = df[df['age'] <= 50]['totcomp'].values
old = df[df['age'] > 50]['totcomp'].values
fig = go.Figure()
fig.add_trace(go.Histogram(x=old, name='age > 50'))
fig.add_trace(go.Histogram(x=young, name='age <= 50'))
fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.75)
fig.update_layout(title_text="<span style='font-size: 18pt;'>The overlapping histograms for subsamples</span>", height=500)
fig.show()
In [336]:
fig = go.Figure()
fig.add_scatter(x=np.unique(old), y=ecdf(old)(np.unique(old)), line_shape='hv', name='age > 50')
fig.add_scatter(x=np.unique(young), y=ecdf(young)(np.unique(young)), line_shape='hv', name='age <= 50')

#fig.add_scatter(x=[2000, 2000], y=[0,1], line_shape='hv')

fig.update_layout(title_text="<span style='font-size: 18pt;'>The empirical cumulative distribution function</b></span>")
fig.show()
In [431]:
ages_values = [[young.mean(),old.mean()],[np.median(young),np.median(old)],[young.var(), old.var()]]


ages_df = pd.DataFrame(ages_values, columns=['age <= 50', 'age > 50'], index=['Mean', 'Median','Variance'])
ages_df['age <= 50'] = ages_df['age <= 50'].apply(lambda x: '%.3f' % x)
ages_df['age > 50'] = ages_df['age > 50'].apply(lambda x: '%.3f' % x)

ages_df
Out[431]:
age <= 50 age > 50
Mean 4489.090 9154.084
Median 1970.000 3161.000
Variance 85762429.261 1182852576.229
In [30]:
young.var()
Out[30]:
85762429.26117685
In [31]:
old.var()
Out[31]:
1182852576.2287145
In [435]:
young.var()/(old.var()+young.var())
Out[435]:
0.06760319631254766

For the age > 50 variance is significaly larger then for the age <= 50.
The median and mean values are also larger for CEO in age > 50.

Consider another grouping of the data. Define the groups:

image.png

(a) Aggregate the data to a 2 × 3 contigency table with absolute and with relative frequencies.

In [39]:
A1_S1 = df.query('age <= 50 & salary < 3000')['totcomp'].values
A1_S2 = df.query('age <= 50 & salary >= 3000 & salary < 5000')['totcomp'].values
A1_S3 = df.query('age <= 50 & salary >= 5000')['totcomp'].values

A2_S1 = df.query('age > 50 & salary < 3000')['totcomp'].values
A2_S2 = df.query('age > 50 & salary >= 3000 & salary < 5000')['totcomp'].values
A2_S3 = df.query('age > 50 & salary >= 5000')['totcomp'].values
In [339]:
table = [[A1_S1, A1_S2, A1_S3], [A2_S1, A2_S2, A2_S3]]

abs_table = []
for row in table:
    new_abs_row = []
    for values in row:
        new_abs_row.append(len(values))
    new_abs_row.append(sum(new_abs_row))
    abs_table.append(new_abs_row)
    
new_abs_row = []
for i in range(4):
    col_sum = 0
    for j in range(2):
        col_sum += abs_table[j][i]
    new_abs_row.append(col_sum)
abs_table.append(new_abs_row)

display(HTML("<h3>Absolute frequencies</h3>"))

pd.DataFrame(abs_table, columns=['S1','S2', 'S3', 'Total'], index=['A1','A2','Total'])

Absolute frequencies

Out[339]:
S1 S2 S3 Total
A1 75 3 0 78
A2 307 37 25 369
Total 382 40 25 447
In [340]:
display(HTML("<h3>Relative frequencies</h3>"))
pd.DataFrame(np.array(abs_table) / len(df), columns=['S1','S2', 'S3', 'Total'], index=['A1','A2','Total'])

Relative frequencies

Out[340]:
S1 S2 S3 Total
A1 0.167785 0.006711 0.000000 0.174497
A2 0.686801 0.082774 0.055928 0.825503
Total 0.854586 0.089485 0.055928 1.000000

(b) Give interpretation for the values of $n_{12}, h_{12}, n_1$ and $h_1$.

$n_{12} = 75$ - the number of CEOs older than 50 years, that have salary smaler then $3000$
$h_{12} = 0.167$ - the relative number of CEOs (in relation to all observed) older than 50 years, that have salary smaler then $3000$
$n_1 = 382$ - the number of CEOs that have salary smaler then $3000$
$h_1 = 0.854$ - the relative number of CEOs that have salary smaler then $3000$

(c) Compute an appropriate dependence measure for $S_i$ and $A_j$. What can be con-cluded from its value?

In [401]:
from scipy.stats import chi2_contingency

chi2, p, dof, ex = chi2_contingency(np.array(abs_table)[:2,:3], correction=False)
print('chi2: ', chi2)
print('p: ', p)
chi2:  9.29826638213019
p:  0.009569893605658778

Problem 2: Descriptive Statistics and Probability Theory: Simulated Data

1. Simulate (with a fixed seed) a sample of size $n = 100$ from the normal distribution with $μ_1 =10$ and $σ_1^2 =9$.

(a) Plot the histogram and compare it to the density of $N(10,9)$.

In [351]:
np.random.seed(42)

mu, sigma = 10, 9 # mean and standard deviation
normal_distribution = np.random.normal(mu, sigma, 100)

fig = go.Figure(data=[go.Histogram(x=normal_distribution, nbinsx=20)])
fig.update_layout(title_text="<span style='font-size: 18pt;'>The histogram for the normal distribution</span>", height=500)
fig.show()

(b) Now draw a sample yi of size n = 100 from t5. Transform it as follows: $10 + 3\sqrt{3/5} · y_i$ . Plot the histogram and compare the density of $N (10, 9)$. What can be concluded and why this example might be relevant for empirical studies?

In [350]:
import scipy.stats as sts

student_distribution = sts.t.rvs(df=5, size=100)
modified_distribution = student_distribution*(3*np.sqrt(3/5)) + 10

fig = go.Figure(data=[go.Histogram(x=modified_distribution, nbinsx=20)])
fig.update_layout(title_text="<span style='font-size: 18pt;'>The histogram for the modified Student distribution</span>", height=500)
fig.show()
In [352]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=normal_distribution, nbinsx=20, name='Normal distribution'))
fig.add_trace(go.Histogram(x=modified_distribution, nbinsx=20, name='Modified student distribution'))

fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.update_layout(title_text="<span style='font-size: 18pt;'>Comparing histograms for the normal and modified Student distributions</span>", height=500)
fig.show()

My text

2. In practice the data is always very heterogenous. To reflect it we contaminate the data by adding an outlier or a subsample with different characteristics.

(a) To obtain a realistic heterogenous sample add to the original normal data a new sample of size m simulated from $N(20,22)$, i.e. $μ_2 = 20$ and $σ_2 = 4$. The size m will obviously influence the above measures. Vary $m$ from $10$ to $200$. (The resulting sample is said to stem from a mixture normal distribution).

In [86]:
mu, sigma = 20, 4 # mean and standard deviation
m_values = [10,50,100,150,200]
add_distributions = []
for m in m_values:
    add_distributions.append(np.random.normal(mu, sigma, m))

(b) Plot Box-plots (or violin plots) and histograms for each subsample individually and for the sample for a few different values of $m$.

In [354]:
from IPython.display import HTML

fig = go.Figure()

for i in range(len(m_values)):
    fig.add_trace(go.Violin(x =[m_values[i]]*len(add_distributions[i]),
                            y=add_distributions[i],
                            name='m='+str(m_values[i]),
                            box_visible=True,
                            meanline_visible=True))
    
fig.update_layout(title_text="<span style='font-size: 18pt;'>Violin plots for each subsample of size m</span>")
fig.show()
In [355]:
fig = make_subplots(rows=5, cols=2) #, subplot_titles=m_values

traces = []
for i in range(5):
    traces.append(go.Histogram(x=add_distributions[i], nbinsx=20, name='m='+str(m_values[i])))
    traces.append(go.Violin(x=add_distributions[i], box_visible=True,
                               meanline_visible=True, name='m='+str(m_values[i])))
    

fig.append_trace(traces[0], 1, 1)
fig.append_trace(traces[1], 1, 2)
fig.append_trace(traces[2], 2, 1)
fig.append_trace(traces[3], 2, 2)
fig.append_trace(traces[4], 3, 1)
fig.append_trace(traces[5], 3, 2)
fig.append_trace(traces[6], 4, 1)
fig.append_trace(traces[7], 4, 2)
fig.append_trace(traces[8], 5, 1)
fig.append_trace(traces[9], 5, 2)


fig.update_layout(title_text="<span style='font-size: 18pt;'>Distributions for samples with differnt size m</span>", height=1600)
fig.show()
In [119]:
import copy
In [120]:
distributions_with_adds = []
for i in range(5):
    new_distribution_with_add = copy.deepcopy(list(normal_distribution))
    new_distribution_with_add.extend(list(add_distributions[i]))
    distributions_with_adds.append(new_distribution_with_add)
In [356]:
fig = go.Figure()

for i in range(len(m_values)):
    fig.add_trace(go.Violin(y=distributions_with_adds[i],
                            name='m='+str(m_values[i]),
                            box_visible=True,
                            meanline_visible=True))
    
fig.update_layout(title_text="<span style='font-size: 18pt;'>Violin plots for modified distributions</span>")
fig.show()
In [357]:
fig = make_subplots(rows=5, cols=2) #, subplot_titles=m_values

traces = []
for i in range(5):
    traces.append(go.Histogram(x=distributions_with_adds[i], nbinsx=20, name='m='+str(m_values[i])))
    traces.append(go.Violin(x=distributions_with_adds[i], box_visible=True,
                               meanline_visible=True, name='m='+str(m_values[i])))
    

fig.append_trace(traces[0], 1, 1)
fig.append_trace(traces[1], 1, 2)
fig.append_trace(traces[2], 2, 1)
fig.append_trace(traces[3], 2, 2)
fig.append_trace(traces[4], 3, 1)
fig.append_trace(traces[5], 3, 2)
fig.append_trace(traces[6], 4, 1)
fig.append_trace(traces[7], 4, 2)
fig.append_trace(traces[8], 5, 1)
fig.append_trace(traces[9], 5, 2)


fig.update_layout(title_text="<span style='font-size: 18pt;'>Modified distribution by samples with differnt size m</span>", height=1600)
fig.show()

(c) Make animated or interactive graphics (with manipulate, plotly, ggplot,etc.) to visualize the impact of $m$ on the histogram and location measures (added as vertical lines in the graph) of the data.

In [380]:
medians = []
means = []
heights = [14,16,25,35,42]

for i in range(len(distributions_with_adds)):
    medians.append(np.median(np.array(distributions_with_adds[i])))
    means.append(np.mean(np.array(distributions_with_adds[i])))
In [387]:
fig = go.Figure()

for i in range(5):
    fig.add_trace(go.Histogram(x=distributions_with_adds[i], nbinsx=25, name='Histogram for m='+str(m_values[i])))
    fig.add_trace(go.Scatter(x=[means[i],means[i]], y=[0,heights[i]], name='Mean for m='+str(m_values[i])))
    
fig.data[0].visible = True

steps = []
for i in range(int(len(fig.data)/2)):
    step = dict(
        method="restyle",
        args=["visible", [False] * len(fig.data)],
        label='m='+str(m_values[i])
    )
    step["args"][1][i*2] = True
    step["args"][1][i*2+1] = True
    # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active=1,
    currentvalue={"prefix": "m: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders
)

fig.update_layout(title_text="<span style='font-size: 18pt;'>Modified distribution by samples with differnt size m</span>")

fig.show()

3. Next step is to simulate two dependent data sets. We simulate two samples with a given value of the correlation coefficient.

(a) Let $U ∼N(0,1)$ and $V ∼N(0,1)$. Let$U^∗ =U$ and $V^∗ =ρU+\sqrt{1−ρ2}V$. Prove that $Corr(U^∗,V^∗)=ρ$ and the variances of both variables $U^∗$ and $V^∗$ equal one.

In [ ]:
 

(b) Use the above idea to simulate two samples of size $n = 100$ from a normal distribution with different values of $ρ$. Compute the correlation coefficients of Pearson and of Spearman. Compare the correlation to the original parameter $ρ$ (for example, plot Pearson vs. $ρ$ and Spearman vs. $ρ$)

In [181]:
import scipy.stats as sts
np.random.seed(4)

mu, sigma = 0, 1 # mean and standard deviation
U = np.random.normal(mu, sigma, 100)
V = np.random.normal(mu, sigma, 100)

p_values = np.arange(0, 1.1, 0.1)

modified_V = []
for p in p_values:
    V_ = p*U + np.sqrt(1-p*p)*V
    modified_V.append(V_)
In [182]:
pearson_values = []
spearman_values = []
for i in range(len(modified_V)):
    pearson_values.append(np.corrcoef(U, modified_V[i])[0,1])
    spearman_values.append(sts.spearmanr(U, modified_V[i])[0])
In [394]:
fig = make_subplots(rows=1, cols=2, subplot_titles=("Pearson / p", "Spearman / p")) #, subplot_titles=m_values

traces = []
traces.append(go.Scatter(x=p_values,y=pearson_values, name="Pearson / p"))
traces.append(go.Scatter(x=p_values,y=spearman_values, name="Spearman / p"))
    

fig.append_trace(traces[0], 1, 1)
fig.append_trace(traces[1], 1, 2)

fig.update_xaxes(title_text="P", row=1, col=1)
fig.update_xaxes(title_text="P", row=1, col=2)

fig.update_yaxes(title_text="Pearson", row=1, col=1)
fig.update_yaxes(title_text="Spearman",row=1, col=2)

fig.update_layout(title_text="<span style='font-size: 18pt;'>Comparing correlations</span>")

fig.show()

(b) Make a nonlinear but monotone transformation of $V^∗$, say $exp$ for simplicity. Check the impact of this transformation on the correlation coefficients of Spearman and Pearson. Think about an appropriate visualization of the findings.

In [186]:
modified_V_exp = np.exp(modified_V)
In [195]:
pearson_values_exp = []
spearman_values_exp = []
for i in range(len(modified_V_exp)):
    pearson_values_exp.append(np.corrcoef(U, modified_V_exp[i])[0,1])
    spearman_values_exp.append(sts.spearmanr(U, modified_V_exp[i])[0])
In [395]:
fig = make_subplots(rows=1, cols=2, subplot_titles=("Pearson / p", "Spearman / p")) #, subplot_titles=m_values

traces = []
traces.append(go.Scatter(x=p_values,y=pearson_values_exp, name="Pearson / p"))
traces.append(go.Scatter(x=p_values,y=spearman_values_exp, name="Spearman / p"))
    

fig.append_trace(traces[0], 1, 1)
fig.append_trace(traces[1], 1, 2)

fig.update_xaxes(title_text="P", row=1, col=1)
fig.update_xaxes(title_text="P", row=1, col=2)

fig.update_yaxes(title_text="Pearson", row=1, col=1)
fig.update_yaxes(title_text="Spearman",row=1, col=2)

fig.update_layout(title_text="<span style='font-size: 18pt;'>Comparing correlations after exponatial transformation</span>")

fig.show()